Introduction


In this work, we assumed the the task of both predicting and quickly dissecting the 311 NYC call volume data set. There are several viable ways to make in-roads in this work, including triaging by Agency and Complaint Type. Ultimately, we were interested in teasing out macro behaviors and began our explorations without any strong bias beforehand. Certain salient aspects were easily distilled: annual seasonality (different by agency), the importance of the actual day of the week and intra-day patterns. Our team successfully applied random forest regression to predict next-day call volume by complaint type; this process was largely influenced by the day of the week, weather and trailing call volume. This process revealed several clusters of agency-complaints with high correlation. Departing from the need to distill macro behaviors, we also explored the comportment of one, rather unique complaint: Grafitti. This complaint hints as certain societal and generation norm differences.


We plot a time series by agency, by year. In this plot We discovered a seasonal pattern. It has a similar behavior in each year. The number of complains increases in winter and decreases in summer.


Additionally we explored the distribution of complaints per day for the entire time frame to discern if we have any obviously trends which might hint concentration on some particular days. The calendar chart helps us visualize not just the distribution per day, but per month every year to detect any seasonal patterns.

Our second question is if the number of complaints could be explained by geographical information, particularly where occur them. In the following map we show May of 2015, We could recognize an important number of complaints in downtown Manhattan and uptown.


Now, we know that time and position are some important factors in the number of requirements. However there are additional variables or interactions that can explain the behavior of our complaints such as, the borough and the type of requirement.
Using GoogleViz, we can combine our plots to create an interactive dashboard. Our first conclusion is that 311 data have top ten important complaints that represent more than fifty percent of all of them. In addition We use a tree Map to represent the number of complaints by borough. This can show us that the complaints in Queens are related with street condition, but eh others claim by Heating. You can click on the interest subject and return with right click on the gray header.


The calendar heatmap plotted next is an efficient way of vizualizing not only the seasonal variation in the number of complaints but the variation as per day of the week. It is quite evident that the number of complaints in the winter months are much higher than the summer months. It was also interesting to note that the weekends (Saturday and Sunday) have lower number of complains consistently throughout our perioud of observation

Daywise Complaint Distribution


In the sections above the complaints were explored across all the years, then across one year and all the months,then across all the days of the week. The next was to understand that during the course of a particular day how does the complaint freuency vary by hour for some top agencies.An animated GIF was created to illustrate the average variability of complaints throughout the course of a day. To create the plot, we aggregated complaints by hour over all days in 2015. Thus, the resulting graphic shows the total number of complaints for the top six agencies (by complaint volume) by hour in 2015. Each of the six agencies has a map displaying the georaphic location of complaints, as well as a histogram displaying the number of calls per hour. Unsurprisingly, all agencies except the NYPD see an increase in call volume throughout the daytime and a sharp decrease at night. Complaints to the NYPD actually peak between the hours of 10 pm to 1 am, possibly due to the nature of complaints they handle: noise, burglaries and assaults (to name a few), all of which occur more frequently at night.

Predictive Analysis of 311 Data

Motivation

If we take a high level view of the New York 311 Call Data there are a dizzing array of ways to slice and organize the data. This stymies the goal of prediction until an appropriate direction is taken. Perhaps one natural and endogenous variable to attempt to predict is the call volume on a temporal and geospatial scale. Our team created a relatively simplistic approach to build some intuition as to what influence total call volumes. our analyses were influenced by the work of Zha and Veloso 2014 who successfully applied a Random Forest approach.

Our preliminary regression analysis revolved around predicting call volume on a given day and to train our models we used as predictors 1) the total call volume of the past 7 days (numeric), 2) Day of the Week (categorical), 3) Snow Days (boolean), 4) inches of precipitation (numeric), 5) mean temperature on the day (numeric), 6) range of temperature on the day (numeric), 7) public holidays (boolean), 8) bank holidays (boolean)

Linear Regression

Naively, we attempted to fit a linear model to gauge the out-of-the-box efficacy of these predictors. This model was run by a few agencies and we run the model with and without Day of the Week. This will highlight the strong power of this predictor and also reveals that the linear model is perhaps not the best approach to exploring the structure of our data.

A clear takeaway is that the inclusion of the categorical variable, Day of the Week, has a great amount of predictive power. Nonetheless, we are discouraged by the use of a linear model for prediction. It seems that a more granular approach is required.

Random Forest

In the work by Zha and Veloso, 2014, the authors had considerable success adopting a Random Forest approach to predicting call volumes. The Random Forest is a pseudo-bootstrap approach over the observed samples to train a predefined number of decision trees. We use decision tree regressors and the forest will then the mean of all the predictions by our constituent trees. In the following sections, our trees are trained over data from 2010 to 2014 and then tested against the observed data in 2015.

Single Batch Forest

At first, we train our forest over all call volume in one batch. This has reasonable success but is still somewhat unsatisfactory. As can be seen in the graph below, when the tree is trained over the aggregate data it reveals some clustering in the data but also fails to accurately adjust for this.

Even before performing a random forest regression, this plot shows that the call volume itself exhibits some clustering around the 3,500 and 6,000 daily call ranges. In addition to the clusters, we can see that the forest some difficulty in predicting the extreme values of the distribution

Feature Importance

One of the advantage of the randomForest package in R is that we can very easily explore the added value of each predictor. This value is typically measured in one of two ways. How much does the mean accuracy on the test data decrease if we remove the predictor? How much does impurity (multi-class representation) happen in the nodes if we remove the predictor? Node purity, while a potential flag for overfitting if abused, is a sign that the tree is mature for prediction.

This feature importance highlights that 311 exhibits what we might call a “Call-center” phenomenon. The day of the week is paramount to predicting the volume on that day. Next most important is TWV, Trailing Week’s Volume, demonstrating a strong auto-regressive property of this data.

Complaint Type Aggregate Reconstruction

At this point, one may correctly question why we should expect all complaints in NYC to behave in a consistent fashion under our predictors. While these features may indeed be quite explanatory, their interactions may differ accross agency or complaint type. In this section, we train a random forest of 250 trees on each of the top 50 complaint types by observed call volume. Then our total daily prediction is the sum of each of the 50 daily predictions.

We can see how the random forests, when trained over a more selective data set are able to better identify repeatable patterns in the data. The built up index of total call volume is reasonably well predicted now by the composite trees themselves! While the built-up aggregate predictor still has difficulty with these extreme days, the day-to-day fluctuations are well understood by the forests. This behavior is reflective of random forest’s tendency to regress towards the mean. While we were able to successfully correct the bias in our regressions, the trees will still continue to have a difficult time finding the structure for such extremities. One might consider a copula method or some density estimation to quantify the return periods of these events.

If we run a simple OLS over a winsored version of the data we find a very encouraging R-squared and a coefficient relating fitted to observed.

ind <- trueVolume > 2500 & trueVolume < 7500
summary(lm(Y ~ X, data.frame(Y=trueVolume[ind], X=predVolume[ind])))
## 
## Call:
## lm(formula = Y ~ X, data = data.frame(Y = trueVolume[ind], X = predVolume[ind]))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2063.89  -342.05   -32.12   298.10  3096.39 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 608.78601  163.86314   3.715 0.000236 ***
## X             0.83282    0.03056  27.251  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 616.2 on 355 degrees of freedom
## Multiple R-squared:  0.6766, Adjusted R-squared:  0.6757 
## F-statistic: 742.6 on 1 and 355 DF,  p-value: < 2.2e-16

Feature Reduction

The problem of generating 50 or even more trees to make a single days prediction, while possible, is perhaps unrefined. The data itself suggests that there may be strong correlations between complaint type and how they manifest in call volumes. In this section, we use visual explorations of the the correlation matrix and EFA to hone in on certain complaint types which may be best considered as one group.

Exploring Correlations of Call Volume

To expand our scope before honing in on potential macro-structures, we present the correlation matrix for the top 70 of 255 complaint types. Ordering the matrix hierarchically we it becomes apparent that there are about 5-7 strongly linked groups or factors ranging in size from 2 to 10 or so complaint types.

Fig - Hierarchical Clustering of Top 70 Complaints.

The eye is naturally drawn to a few clusterd factors but it is also apparent that 3 complaints are strong anti-correlated to the several factors. NONCONST, GENERAL CONSTRUCTION, and PAINT - PLASTER are all strongly correleted amongst themselves but negatively correlated with a group of complaint types including PAINT/PLASTER. Both of these are labels used by the Agency HPD, which may be explained by the use of labels on alternating days introducing an anti-correlation. However other complaint types are unique by Agency regardless of similar sounding complaints.

Exploratory Factor Analysis on Call Volume

Using the Promax rotation, we generate 4 factors from an EFA on the data matrix of the top 70 complaint types. The leading factor captures 19% of the variance is its largest components are associated with the Agency HPD centering around complaints related to a specific unit or the conditions of the living space. These include Heat hot water, safety, animal abuse, floor/stair, and door/window. The next factor captures 16.5% of the variance. This factor is driven mainly by the HPD and DOB but it seems to be associated with the complaints of the infrastructural aspects of a building or living conditions. The third factor, tying in 10.3% of the variance, has componets related to dead trees, public space maintainence and NYPD noise complaints. The last key factor has a decreased contribution of total variance, only 6.2%. Furthermore, this factor is much less interpretable and is linked to literature requests. Beyond these four factors, the contribution to variance falls off quickly.

We have reasonably good agreement between the EFA and the clustered factors from the Correlation matrix. Both of these approaches quickly pull out groupings that are related to housing both within the apartment and in the building itself as well as the public space where pedestrians usual walk.

Random Forest Reconstructions on Groupings

Using the correlation matrix to guide some intuitive groupings we attempted to reconstruct call volume on a reduced set of “meta” complaint types. The graph below represents a random forest being trained on whole subsets of complaints that are deemed highly correlated. We can see that forgoing the process of training on every single complaint type is not very detrimental to our overall predictive performance.

An animated GIF was created to illustrate the average variability of complaints throughout the course of a day. To create the plot, we aggregated complaints by hour over all days in 2015. Thus, the resulting graphic shows the total number of complaints for the top six agencies (by complaint volume) by hour in 2015. Each of the six agencies has a map displaying the georaphic location of complaints, as well as a histogram displaying the number of calls per hour. Unsurprisingly, all agencies except the NYPD see an increase in call volume throughout the daytime and a sharp decrease at night. Complaints to the NYPD actually peak between the hours of 10 pm to 1 am, possibly due to the nature of complaints they handle: noise, burglaries and assaults (to name a few), all of which occur more frequently at night.

As an extension to the random forest regression model, we decided to investigate how complaints are distributed across the different zip codes of New York City. To try and predict this, we created a matrix with dates (January 1st to December 31st) as rows and zip codes as columns, where each entry corresponded to the probability of a specific complaint type being in that zip region on a specific day of the year. Laplace smoothing was then added to ensure that the probability of a new complaint occuring in a given region is never zero. The probability matrix was then multiplied by the predicted volume of complaints given by the random forest model to obtain the distribution of complaints by zip code. To validate the model, we used 2010-2014 as training data and 2015 complaints as test data. Three complaint types were randomly selected to gauge how our model fares: Taxi complaints, noise and plumbing complaints. For each of the three complaint types, three different choropleth maps were produced: one shows the actual distribution of complaints in 2015, the next shows the predicted distribution of complaints in 2015 and the last shows the difference between the two (i.e. where our model did less well).

Noise Complaints

The majority of the noise complaints that occur in New York city are concentrated in Manhattan, which is captured on the predicted map. The difference choropleth map also shows that the prediction fared well around the areas where there are many noise complaints.

Taxi Complaints

The majority of the taxi complaints that occur in New York city are concentrated around Manhattan, which is also captured on the predicted map. The difference choropleth map also shows that the prediction fared well around the areas where there are many taxi complaints.

Plumbing Complaints

The majority of the plumbing complaints that occur in New York city are concentrated around Brooklyn and the Bronx, which is also captured on the predicted map. The difference choropleth map also shows that the prediction fared well around the areas where there were the most Plumbing complaints.

In addition to the three choropleth maps, two scatterplots are displayed below: one shows the error rate of the model based on the number of observations and the other displayes how we changed the smoothing parameter to minimize the mean squared error of the model.

As is shown above, our model fares very well overall, but seems to be very volatile when the number of observations in a specific region is small. In regions where a large numer of complaints were observed, (i.e. taxi complaints in Manhattan) the model is very accurate and the performace is clearly outlined by comparing the blue choropleth map to the actual number of observed complaints.

Graffiti, a controversial complaint
Graffiti, a controversial complaint

In the previous analysis and correlation matrix, we showed the cluterings of multiple complaints that make intuitive sense. There was, however, one specific complaint that appears decorrelated with every other type of complaint. It was graffiti. To explore why and how graffiti was such an interesting case by itself, we dedicate this section to multiple factors involving graffiti.

Graffiti, icon for hipster culture

Perhaps graffiti is the most controversial form of complaints in the 311 data. Most people would agree that vehicle noise or broken heating is a pain to the community, but graffiti has been a topic of debate. Certainly many people find the vandalism and nudities carried in the mural paintings disturbing, but plenty others treat them as art that represents hipster culture. In fact, when considering the several neighborhoods that experienced a sharp increase in rental prices in the past decade, one would find that these neighborhoods are more or less associated with a young and hispter dynamic. A causality explanation seems reasonable: young people, together with graffitis, bring dynamics to a neighborhood and makes it trendier. That is usually followed by new restaurants, bars and entertainments, which in turn drive the housing prices higher.

Hence our first exploration is on the neighborhoods’ rental prices and graffiti complaints. Using data from zillows.com, we parsed the average rental price in each neighborhood over past 5 years. Then we plot the number of graffiti complaints in the 10 neighborhoods with the biggest housing price surge in a streamgraph.

Graffiti complaints in the 10 neighborhoods with surging housing prices

As one moves mouse over each stream, it shows the name of the neighborhood and the number of graffiti complaints received each year. We see names we would expect such as Williamsburg. The most interesting insight from this graph is that the only neighborhood that experienced an increase in graffiti is East Newyork & New Lots, a neighborhood that doubled its graffiti density from 2013 to 2014.

East Newyork: only neighborhood with increase in graffiti.

More insight can be extracted when we plot the graffiti locations in the neighborhood. In the following graph, the colors represent the days it takes to close the case. That is to say, the darker the color, the longer the graffiti lasts.

Tough graffitis in New Lots

As one can easily identify on the map, most of the longer-lasting graffitis are along the subway trail of J and Z line. That makes sense as in New Lots, the subway station is above ground, and the graffitis along the trail will have better public exposure which in turn leads to their longer life span.

We also found an news article identifying New Lots as the next trendy neighborhood. Together with the above-ground graffiti hipster culture, it seems we have a good reason to explain the correlation between the graffiti complaints and housing prices. From a Bayesian network perspective, the hipster culture would be the parent of two children: graffiti and housing price.

News on New Lots.

Gentrification of the neighborhood drives graffiti out

Unfortunately, sometimes this phenomenon that seems to bring more attention to graffiti can actually turn out to be a bad news for the artists. In early 2014, a similar article identified Astoria as a top neighborhood in NY for artists. In that case, a different story took place: the housing price went up so fast that it soon became no longer affordable for graffiti artists. In the graph below we clearly see how the gentrification forced the artists out.

News on Astoria.

Astoria: graffiti and housing price

In the above analysis, we saw how gentrification in areas outside of Manhattan has influenced graffiti’s popularity. Will New Lots become the next Astoria where the surging housing prices eventually drive the artists out and force them to find the next yet-to-be trendy block? Maybe. As a side note, in the streamgraph above, SoHo has alread experienced a drop in graffiti complaints in the past 3 years, perhaps mirroring what happened in Astoria.

Now we turn our attention to graffitis within Manhattan, where graffitis don’t have the luxury to be exposed in subway trails above ground. In analyzing what affects graffiti in the city, we ask ourselves: who are the people filing complaints on graffitis? After all, graffitis don’t give us sleepless nights or freezing apartments that pose harm to our necessities. Who would bother filing complaints about some perhaps grotesque images or blasphemous phrases on a wall?

Parents, the concerned audience

Our hypothesis to the above question is that parents with young kids are likely to file complaints on graffitis, so as to prevent their kids’ exposure to nudity/vandalis/controversial icons. Because it’s hard to access data on distribution of parents in different neighborhoods in Manhattan, we used a proxy by locating the child care centers in the city. Intuitively, more child care centers in a block suggests a larger density of toddlers and babies. The data source is from https://a816-healthpsi.nyc.gov/ChildCare/

More kids nearby fastens the graffitis to get removed

We did a feature engineering by counting the number of child care centers in 3 blocks of where the graffiti complaint was filed. Then we broke down the data by the number of child centers. For each level of the child care density, we plot the distribution of days graffitis last in the neighborhood (this is calculated as closed date - start date ).

As we can see from the graph, the average days it takes to remove graffitis drops from 5 months to 2 months when the number of child cares increase. It does seem that parents acts as an extra pressure on the graffiti’s survival in Manhattan.

Child care and graffiti life span

Galleries, a refugee for graffiti

So far we saw some bad news for graffiti. Now we want to explore the other side, namely, if any information functions as an indicator that graffiti can be longer lasting. We got our inspiration from the history of graffiti in the city: government declared a war on graffiti in the 80s to eracidate the subculture icon, in support of the ‘Fixing Broken Windows’ philosophy (a statement with a belief in positive correlation between graffiti and crime) Indeed that was an era when graffiti was facing a much tougher situation than today. It was the emerging trend amongst galleries that functioned as a last straw to graffitis. Gallries in New York started to embrace the street art as modern icon for the first time in 90s. This movement greatly helped graffiti’s survival.

To validate if the galleries still function as a refugee for graffitis today, we collect data on galleries in the city.(https://data.cityofnewyork.us/Recreation/NYC-Art-Galleries/dqn4-tbkx) We can see an obvious clustering of them from the following heat map. The three clusterings are Chealsea, Upper East and SoHo, each indeed known for their various galleries.

Conclusion on graffiti complaints

With hypotheses built from history, common sense and daily observations, we explore various aspects of graffiti complaints and some potential causality exploration. What we have here is more of a survival guide for graffiti artists in New York:

To have your graffiti work last longer and get more public exposure, pick a neighborhood that is just starting to increase its hipster population and housing prices. Make your work in a subway station above ground if possible as more people get to see them and they last longer. Chances are though that the neighborhood will soon get too expensive for you to afford. You have some options, one of which being moving to Manhattan and painting in a block teamed up with galleries. However, be mindful of the concerned parents who would do everything to remove your work, including calling the 311.

References

Zha, Yilong and Manuela Veloso. “Profiling and Prediction of Non-Emergency Calls in New York City.” Association for the Advancement of Artificial Intelligence, 2014: 41-47. Print

Zhang, Guoyi and Yan Lu. “Bias-corrected random forests in regression.” Journal of Applied Statistics. 19 May 2011